Introduction to Open Data Science - Course Project

Learning expectations for the course

My background

MSC. biology, currently working at SYKE, ca. 5 years experience. Main interests: sustainability at macro scale, national EE-IO analysis, food system sustainability modeling, circular economy..

What I’m looking for by attending

In my projects at SYKE we’re utilizing R in various capacities, on a daily basis. I’m expecting to expand my skill level concerning R, and maybe find new tools/packages to use in my work. Maybe this’ll be easier when there’s some structure to the learning.

Our team just implemented Git as well, so it’s nice to get a bit deeper into it as well.

Additionally, the statistical component will most likely benefit me as well. I wish to expand my repertuare of analytical capabilities overall. At this point, I’m not looking to learn any single particular analysis type (since it’s hard to tell what will be useful in the coming years), so it’s good to have a solid reference base

Impressions on Excercise 1 and the R for Health Data Science book

Initial impressions suggest this would’ve been a good learning material when I first started dealing with R. The basics are made simple to understand. but the scope seems to go quite a bit deeper than that. There seems to be plenty to learn and I’m tempted to treat this as a reference material in future too (or a tutorial for colleagues, if I’m allowed to share it outside the course?)

## [1] "Chapter complete on: Tue Nov 21 16:29:30 2023"

2: Regression and model validation

Describe the work you have done this week and summarize your learning.

Step 1: Data wrangling

1.1: Install required packages

  • To make data retrievals more fluent, we can use the Here package to make setting file paths easy.
  • “Here” eliminates the need of setting precise file pathways for data retrieval or saving, as it automatically looks for the designated file in its “starting point path”, by default the current working directory, viewable by command here().
if (!require(here) == T) {
  install.packages("here") #Checks if the package is NOT installed. If this evaluates to TRUE, runs installation.  
}
library(here)


if (!require(tidyverse) == T) {
  install.packages("tidyverse")
}
library(tidyverse)

if (!require(GGally) == T) {
  install.packages("GGally")   
}
library(GGally)

1.2 Run the script “create_learning2014.R”

  • The required data set, learning 2014, is downloaded and edited in a separate script.
  • Package “here” automatically looks for the file in the path specified in here(), which defaults to the current project work directory.

Step 2: Analysis phase

2.1 Graphical summary of the variables

  • First off, we produce a faceted plot of all variables.
ggpairs(Data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

Commentary

  • Considerably skewed gender distribution.
    • Excact values 110 female, 56 male.
  • No immediately obvious patterns or clustering can be seen in the scatter plots.
    • Except maybe for column 2. In these plots, majority of data points cluster to the left (towards low values).
  • Multiple statistically significant correlations are evident in the data.
  • Variables “Age” and “surf” appear to correlate negatively. We see a slight statistical significance for it as well ( . = significance threshold 0.1)
  • Distribution of “Age” is strongly skewed towards low values (see lineplot at intersection Age x Age)
    • Mostly people on the younger side are represented in the data.
  • Attitude and Points correlate positively and show a high significance level.
    • Strongest positive correlation coefficient that the data has.
  • Variables “surf” and “deep” likewise correlate strongly, but in a different direction (negative coefficient)
    • Their correlation is the strongest negative one seen in the data.,

2.2 Simple linear regression

  • For this task, we create a model that attempts to explain variation in points by the values of deep learning, attitude and age.
x<-lm(Points ~ deep+Attitude+Age, data = Data) #Create model, assign a name "x" to it
summary(x)
## 
## Call:
## lm(formula = Points ~ deep + Attitude + Age, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.206  -3.430   0.204   3.979  10.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.60773    3.38966   4.605 8.32e-06 ***
## deep        -0.60275    0.75031  -0.803    0.423    
## Attitude     0.35941    0.05696   6.310 2.56e-09 ***
## Age         -0.07716    0.05322  -1.450    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared:  0.2043, Adjusted R-squared:  0.1896 
## F-statistic: 13.87 on 3 and 162 DF,  p-value: 4.305e-08

Commentary

  • Just one variable, attitude, had a statistically significant relationship with Points.
  • Next, we refine the model by removing the 2 nonsignificant variables
  • Can these two be purged from the model, without suffering a loss in explanatory power?

Model version 2 (removal of the non-significant explanatory variables)

z<-lm(Points ~ Attitude, data = Data) #Create model, assign a name "x" to it
summary(z)
## 
## Call:
## lm(formula = Points ~ Attitude, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
  • Fit remains at about the same ballpark, ca. 20 %, despite the removal of the two other variables.
    • This single variable explains the variation in the data just as well, despite being simpler than the first.
  • P value indicates the likelihood of obtaining a similar result as seen, assuming that the null hypotesis is true (in our case, h0 would be something like: “No relationship exists between attitude and points”)
    • Low P lends support to the idea that we can reject h0.
    • Here, P is lower than 0.001 -> strong support for rejecting h0.
  • “Estimate” shows regression coefficients, that indicate how severe the relation is (simply put, bigger value = steeper slope).
    • When the value of attitude increases by 1 unit, response variable increases by 0.35
  • Model fit parameters (R2) indicate a rather poor fit (circa 20%). R2 indicates, how well or badly the fitted model suits the data. Poor fit may result, if we attempt to describe a non-linear relationship using a linear model…
    • Attitude explains circa 1/5 of the variation in Points.

Residuals

  • Next, we plot model residuals to assess whether or not the model fulfills prerequisite assumptions of linear regression.
  • Failing to meet these leads to unstable models, biased regression coefficients and in turn unreliable predictive power.
plot(z, which=c(1,2,5))

  • The assumptions we diagnose are:
    • Equality of variances.
      • The method assumes that variance in the data remains equal, despite parameters taking on different values.
      • In stats jargon, data that has equal variances is called “homoscedastic”, while heteroscedastic data is the opposite.
      • To diagnose this, we can look at the “Residual v Fitted, or”Scale-location” plots (not shown).
      • Here, if the assumption is met and variances are equal, the data points should be spread more or less evenly along the line.
      • On the other hand, if we saw e.g. a funnel-like shape (points that cluster closer together on one end of X axis than on the other end), we could say that the variances were not equal.
      • Here, most of the data points are spread relatively even…except for the points on the extreme right of the plot.
        • Points from X axis tick 26 onward are grouped more tightly together.
        • Variances may not be entirely equal…
        • However, I am tempted to still say that they are “good enough”, as only a tiny bit of the data mass seems to misbehave.
      • Linearity assumption.
        • If we want to use linear regression, the relationship we examine should be linear in form. If it is not, a method change would be warranted.
        • This can be examined via “Residuals vs Fitted” plot.
        • In a well-behaved plot, we look for data points spread randomly around the 0-line, in a “shotgun blast” kind of arrangement, with no observable patterns or funnel shapes (see 1st assumption, this plot can be used for it as well). Such a distribution indicates that the relationship is indeed linear.
        • Here, all seems to be in order as the points are randomly spread around the zero line.
      • Normality of residuals
        • Residuals should be normally distributed.
        • Diagnosed via the Q-Q residuals plot.
        • In an ideal case, the data points should follow the line on the plot without deviating.
        • Here, data seems mostly fine, as only the tails diverge slightly from the line. Not perfect, but good enough?. 100% normality is a bit too much to expect from real-world data?
  • Next, we can check if there are any highly influential observations in the data mass.
    • Residuals v. leverage plot highlights cases that have a high influence on model parameters. If we were to remove them, model parameters (coeffs, R2…) would change considerably.
    • Plot shows that observation on row 71 has a high impact.
    • Removing it would have a high impact on model coefficients.
    • In a real world application, I would take a closer look at these, and see if their values make sense or are there mistakes in the data. Contextual knowledge is needed for this.
    • In each such case, a judgement call would have to be made on whether the value should stay or leave.

Assignment 3: Logistic Regression

Install required packages

  • Note that running this procedure requires:
    • Tidyverse
    • gt
    • here

Reading source data into R

  • First off, we read the required source data into R.
    • A separate script was created for this.
    • Data frame “alc” contains the necessary data
source(here("Scripts/create_data_assignment3.R"))

Forming initial hypotheses

  • In this excercise, we try to find out if there is a connection between alcohol consumption and other variables
  • The meanings of each variable are explained under “Additional variable information” here
  • The following 4 are chosen and initial hypotheses for the proposed relationship formed:
    • absences: could high rates of alcohol consumption be affiliated with high rates of absences from school?
    • health: hypothesis is that people consuming higher doses of alcohol may have a worse health condition than those who abstain?
    • failures: increased likelihood of course failures among heavy drinkers?
    • famrel: increased likelihood of heavy drinking leading to family conflicts (poorer family relations?)

Summary table for variables, graphical examinatons

taulukko<-alc %>% group_by(high_use) %>% summarise(`Average absence` = mean(absences),
                                            `St.Dev, absences.`=sd(absences),
                                            `Average health` = mean(health),
                                            `St.Dev., health`=sd(health),
                                            `Average failure` = mean(failures),
                                            `St.Dev., failures`=sd(failures),
                                            `Average family relations` = mean(famrel),
                                            `St.Dev., family relations` = sd(famrel))
gt(taulukko) %>%
  fmt_number(decimals=2) %>% cols_align("center") %>% opt_stylize(style=1, color="green")  %>% tab_header("Summary table")
Summary table
high_use Average absence St.Dev, absences. Average health St.Dev., health Average failure St.Dev., failures Average family relations St.Dev., family relations
FALSE 3.71 4.46 3.49 1.41 0.12 0.44 4.01 0.89
TRUE 6.38 7.06 3.73 1.40 0.35 0.75 3.77 0.93

Absence

  • on average, absence rates seem to be higher in the high alcohol consumption group than in the low consumption group (6.4 versus 3.7). More variation in the high-consuming group.
    • Consistent with initial assumptions.
ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Distribution of `absences` values by group")

Health

  • Regarding health, differences between groups seem small.
    • Contrary to expectation, alc consumption does not seem to differ between groups?
  • Mean 3.5 versus 3.7, almost the same. SD at 1.4 in both…
ggplot(alc, aes(x = high_use, y = health)) + geom_boxplot()+ggtitle("Distribution of `health` values by group")

Failure rate

  • Failure rate is on average slightly higher in the high consumption group, as was assumed, but the difference is less pronounced than I was expecting.
  • 0 failures seem to be the most common situation regardless of group, any other values are marginally represented in comparison.
Frequency of `failure` values by group
failures n
FALSE
0 238
1 12
2 8
3 1
TRUE
0 87
1 12
2 9
3 3

Family relations

  • Family relation values appear to be on average slightly poorer in the high alcohol consumption group.
  • This is the sort of finding I expected to see, but the magnitude of difference between the groups is lower than I thought it would be. .

Logistic regression

Defining the GLM model, and extracting the coefficients from summary

# find the model with glm()
m <- glm(high_use ~ failures+absences+health+famrel, data = alc, family = "binomial")

# create model summary and extract the coeffs
summary(m)

coeffs<-m %>% summary() %>% coefficients()
  • All others except family relations are positively correlated with high alcohol use.
  • Strongest positive correlation is Failures. Also statistically significant.
  • Also absences correlates positively, shows statistical significance.
  • Family relations correlates negatively, is statistically significant. Pretty much as I would expect
  • Health: no statistical significance seen in this data. Does not necessarily indicate that there is absolutely no link between health status and drinking in existence even if this data doesn’t show it, as the body of medical evidence in published literature indicates otherwise. Would a different set of data/different measured metric have given a different outcome? Could the students self-assess their health status as more rosy than it really is, as not all aspects of ill health are immediately obvious to the respondent (e.g. high blood pressure)?
  • Next step: odds ratios

Converting coeffs to odds ratios, calculating confidence intervals

OddsRatio<-exp(coeffs)
ConfInt<- confint(m)

Result<-cbind(OddsRatio, ConfInt)
print(Result)
##              Estimate Std. Error    z value Pr(>|z|)       2.5 %      97.5 %
## (Intercept) 0.4634471   1.790547  0.2670735 1.205335 -1.93186722  0.36139875
## failures    1.8054371   1.219527 19.6267894 1.002916  0.20659753  0.99228511
## absences    1.0848599   1.022825 36.9319701 1.000307  0.03933340  0.12816947
## health      1.1387751   1.092594  4.3383139 1.152858 -0.04154535  0.30645324
## famrel      0.7598363   1.137448  0.1185288 1.033507 -0.52884754 -0.02203822
  • These estimates are on a different scale than in the more familiar linear regression
    • There, a coefficient can be interpreted directly. When the value of the independent variable increases by 1 unit, the dependent’s value changes by 1 unit as a response . Not the case here.
    • These are odds based and need to be thought of in terms of likelihood, not as a direct change in the value of dependent variable’s value as independent’s value changes.
  • Odds ratios quantify the strength of the relationship between dependent and independent vars.
  • If OR = 1, the odds of being a heavy drinker or not are the same, regardless of the dependent variable.
  • Here, failures has the strongest link with heavy drinking. An increase of failures by 1 increases the likelihood of being a heavy drinker by almost 2-fold (1.8)
  • For absences and health, the increases in odds are less pronounced.
  • The relationship with famrel is inverted in comparison to the others: Famrel increase decreases the odd of heavy drinking by 0.25. This was pretty much the expected outcome

Predictive power

Refining the model: culling variables that showed no statistical significance

  • Health will be dropped, all others remain
m <- glm(high_use ~ failures+absences+famrel, data = alc, family = "binomial")

Predicting the probability of having the status “high use” based on the model, add as new column to alc frame

  • here, we take the model from before, and based on it calculate the probability that a certain student (row in data) has the attribute “Strong drinker”.

  • Then, we use the probabilities to make a prediction of high_use.

  • The model takes actual values from the data, and based on them estimates how likely it is that a certain row (student) has the attribute (heavy drinker) These prediction may, or may not, match what the actual value was on each row.

alc$predicted_probabilities <- predict(m, type = "response")  #type determines which outcome is given, see ?predict


alc <- mutate(alc,prediction  = predicted_probabilities > 0.5) #Selects those, for which the model indicates should have more than 50% prob of being a heavy drinker. T if over 50, F if not. 

Comparison of actual values (High use or not) and predicted values. How many mismatches in actual terms, how accurate were we?

Direct comparison by counts and percentages

print(x)
##         prediction
## high_use FALSE TRUE
##    FALSE   242   17
##    TRUE     92   19
print (y)
##         prediction
## high_use     FALSE      TRUE
##    FALSE 65.405405  4.594595
##    TRUE  24.864865  5.135135
  • Rows that were in reality (as in the value of high_use) FALSE, were predicted to be FALSE by the model (correctly classified) 242 times (65.4 %)
  • For TRUE-TRUE case, just 19 observations were recorded (5 %). In this kind of tabulation, it has to be remembered that the groups are not nearly equal in size. There’s just 111 TRUE cases, and 259 FALSEs.

Penalty function definition

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5 
  mean(n_wrong)
}

Idea of the function, what it does and why

  • Task: Find the proportion of “light drinkers” that the model misclassified as “heavies”

  • Set probability of being heavy drinker at 0%

  • Function takes:

    • class” of a student, can be either 1 or 0, heavy drinker or not.
    • Then it takes a set probability, and substracts them.
  • Say, that it finds a student with a class of 0.

  • Substracts the probability (0) and class value. If class is 0 and prob is 0, result is also 0, lower than 0.5 (FALSE).This is a correct classification, as we just set the probability of being a heavy drinker at 0.

  • If it were a positive number, it’d have found a heavy drinker (1- 0 = 1, which is > 0.5, TRUE), despite the prob being 0 (=found a misclassed case)

  • A converse case: we want to find cases in which class is 0, even though we set the probability of being heavy drinker at 100 %

  • Conversely, if it was set at 1 (100% likelihood of being a heavy drinker), and the function finds a class of 1, the function would look like: abs(1-1) = 0, which is not > 0.5. We would have a correctly classed case.

  • If it finds a misclassified one, it’d be (0-1) = -1, which as an absolute number (drop the minus) is higher than 0.5.

  • In each scenario, the function also calculates the mean

  • Here, we set the function to find students of class 0, and check if they are mismatched as 1

loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
  • The average number of mismatches, of students that were found to be of class 1, despite the probability of of being such set at zero, was 0.3.
  • If there were only mismatches found by the function (i.e. all 1:s in the n_wrong object), the mean would be 1. The less mismatches, the lower the value –> Better result regarding accuracy.
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
  • The opposite case (0’s found when probability was 1) was 0.7.

  • Model appears to misclassify 0:s as 1:s almost doubly more often, than the opposite way.

Cross validation

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2945946
  • Compared to the model in the excercise, these predictors we now look at have a bit higher mean prediction error, but not overwhelmingly so. The difference is rather small (circa 0.3 versus 0.26)

further delving into the bonus section if time allows…

Chapter 4: Clustering & Classification

Data wrangling

Not included, as this time it is listed as the final assignment and deals with next week’s data. This has been completed; create_human.R” is available in the repo under Scripts.

Installation of the necessary packages

require(tidyverse);require(here);require(MASS);require(corrplot); require(GGally)

Analysis

Data overview

  • First, we load the Boston dataset and check its dimensions & structure.
data(Boston)

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
  • Overall the dimensions are 506 x 14.

  • All of the 14 variables appear numerical (integer or double).

  • Next, we look up what each of the 14 vars represents.

  • The dataset has an official documentation.

    • Input command “?Boston” to view.
    • In the documentation, variable definitions are provided.
  • The data contains information on housing in the Boston region, USA.

  • Included variables are:

    • CRIM - per capita crime rate by town
    • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
    • INDUS - proportion of non-retail business acres per town.
    • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
    • NOX - nitric oxides concentration (parts per 10 million)
    • RM - average number of rooms per dwelling
    • AGE - proportion of owner-occupied units built prior to 1940
    • DIS - weighted distances to five Boston employment centres
    • RAD - index of accessibility to radial highways
    • TAX - full-value property-tax rate per $10,000
    • PTRATIO - pupil-teacher ratio by town
    • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    • LSTAT - % lower status of the population
    • MEDV - Median value of owner-occupied homes in $1000’s
  • Next, graphical and table format summaries are generated for the data

  • First, summary as a table

summary(Boston)
  • The summary values for each var are wildly different. They are all on different scales, despite being all numeric. This is why standardization over vars will soon be done, to make them comparable.
ggpairs(Boston)

  • There’s quite a lot of options on what to look at here. I’m going to cherry pick some findings, instead of going through every variable.

  • Multiple variables have skewed distributions. For example:

    • Age skews strongly towards high values -> mainly old buildings (built before -1940)
    • Variable dis (distance to employment centers) has a skew towards low values (distance to employment centers commonly small)
  • Also, several variables have bimodality (= more than 1 peak in the distribution, meaning that 2 values are more common than others)

  • Scatterplot Indus x NOx appears to show 2 distinct groupings? For most data points, both NOX concentrations and the share of industrial acreage are low (these have a strong, statistically significant correlation coeff too!)

  • Crime rate appears to have a statistically significant correlation with almost all of these vars. Seems to correlate positively with the proportion of industrial acreage, NOX concentrations, large residential land areas…

  • The distribution of “indus” (business acreage) shows bimodality: we have two peaks, indicating that a couple of values are considerably more common than others. This variable also correlates w. high statistical significance with NOX emissions, which makes sense as the variable represents the prevalence of business acreage like industry.

  • Distribution of NOX is strongly skewed towards small values.

  • Age skews strongly towards high values. Overall, most construction in the regions of the data was done prior to 1940.

  • Again, we see bimodality in property tax rates. Low and high ends of the spectrum have clear peaks.

Dataset standardization

  • As explained above, all these variables are numeric but have wildly different measurement scales. Hence, standardization.
  • Let’s print summaries of both standardized and non-standardized data and compare
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
summary(Boston)
  • This procedure changed the scales on which all the different vars are measured. Previously, they were all different as they describe very different things. Now, we have “forced” all of them to a similar scale. See for example how the scale of “black” changes: max was almost 400, way more than for any other var. Due to standardization, it became 0.44. All the vars are now on the same scale.

Categorigal crime variable creation

  • Division to bins according to quantiles, set it as a new variable to the old frame
bins <- quantile(boston_scaled$crim) 

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))


boston_scaled <- data.frame(boston_scaled, crime)

boston_scaled$crim<-NULL

*Dividing the data to training set and testing set

n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = nrow(boston_scaled) * 0.8)

# create train set and test set
train <- boston_scaled[ind,] #Randomly selects 80% of rows, index numbers
test <- boston_scaled[-ind,] #everything except the indices in the training set

LDA

  • Fitting LDA on the training data, then plotting it.
  • Here’ we seek to find a linear combination of variables that best separates the data into groups.
  • Looking at vector directions & magnitudes tells us about which LD it has the most effect on
  • Variables’ relationships with each other can also be seen. Parallel vectors show correlation, while perpendicular directions are the opposite case.
lda.fit <- lda(crime ~ . , data = train)


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2)

  • RAD has the largest impact on LD1. For LD2 it lookse to be either Nox or zn, albeit the magnitude appears similar…
  • Nox and rad, as well as zn and rad, are almost perpendicular with each other (like a 90 degree angle) -> no correlation.
  • Most of the other vars cluster very close together. None of them have vector magnitudes even close to those seen for zn, rad and nox.

Testing model prediction power

  • First, we save the “real” class values of the test set, and remove it from the frame.
  • We will use these to compare how well the same model we used on the training data classifies the actual test data-
correct_classes<-test$crime
test$crime<-NULL


# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results

table(correct = correct_classes, predicted = lda.pred$class)
  • 30 correct “lows”, only 7 of which were correctly classified. 23% success rate…
  • 25 med-lows, of which 12 are correctly classified. 48% success rate, better but still not good enough.
  • 26 med-highs, just 9 correctly placed. 34 %…
  • 21 highs, of which 19 got correctly placed. A 90% success rate for this category -> high crime rates appear to be predicted very accurately, while in all other categories the performance is very lackluster.

K-Means

*Clearing .env, reload Boston raw data and scale it

data(Boston)

Boston_scaled<-as.data.frame(scale(Boston))

*Calculating distances between data points

dist_eu <- dist(Boston_scaled)

Run K-means clustering algorithm on the scaled data here, we run it on 6 centers as optimization will follow.

# k-means clustering
km <- kmeans(Boston_scaled, centers = "6")
pairs(Boston_scaled[1:6], col=km$cluster)

*Quite a confusing table…K of 6 is arguably too many.

*Optimizing K

  • In the optimization, we look at the WCSS (Within group sum of squares)
  • A high WCSS indicates that even though we have classified data into a same cluster group, there would still exist considerable variation within that group.
  • We want to find a K value, for which the variation within each group is as small as possible (i.e. the clusters should be formed with the idea of grouping similar data points together, with as little “wild” variation between points in the same group as possible)
  • Visually, we check the plotted curve, and observe at which point (K value) the most “benefit” has been gained from increasing K.
  • The slope of WCSS is steep and rapidly decreases up to K-level of circa 2. After that, we slope grows considerably milder -> even if we increase K, the resulting “gains” (or here, losses, as we want small WCSS values) grow less and less pronounced.
    • So, after K= circa 2, not really worth it to add any more centers…
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

### Re-running analysis with K set to 2

km <- kmeans(Boston_scaled, centers = 2)

pairs(Boston_scaled[1:5], col = km$cluster)

pairs(Boston_scaled[6:10], col = km$cluster)

  • With K at 2, it’s a lot easier to see what’s happening.
  • Let’s zoom into some of these:
  • Multiple distinct groupings. For example:
    • Black data points in the tax - age pairplot: cluster of properties high in both tax rate and age
pairs(Boston_scaled[c(1,10)], col = km$cluster)

  • Crime rates and high property tax values (wealth indicator?) appear to cluster separately. Wealthier neighbourhoods, less crime?
pairs(Boston_scaled[c(3,5)], col = km$cluster)

  • Industrial zoning and Nitrogen oxide emissions cluster as well. A distinct grouping of black data points show high emissions, when the rate of industrial zoning is also high. And when the zoning is not industrial, emissions are also low.
pairs(Boston_scaled[c(7,14)], col = km$cluster)

  • Housing value and age appear to show a large cluster in black, with high age (old housing) and low value.
pairs(Boston_scaled[c(1,14)], col = km$cluster)

## Bonus

  • Load original data, standardize, and run K means with 3 clusters
data("Boston")

Boston_scaled<-as.data.frame(scale(Boston))


km <- kmeans(Boston_scaled, centers = "5")
  • Then, fit LDA with clusters as the target (dependent)
lda.fit <- lda(km$cluster ~ . , data = Boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2.4)

#More zoom
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =20)

  • Now, location on the riverside looks to be the main determinant for group 1. For 3, it appears to be accessibility to radial highways. Remaining variables cluster together in a single mass, with no single ones standing out.
  • In the more zoomed plot, LSTAT looks to be important for group 3, while zn is important for 2.

Super Bonus

*Install plotly

library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

*Calculate matrix products

lda.fit <- lda(crime ~ . , data = train)

model_predictors <- dplyr::select(train, -crime)

dim(model_predictors)
dim(lda.fit$scaling)

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
  • 3D Plot

  • Colored by the crime classes of the training data, we get the pink “High crime” data points grouped on their own (some overlap with med-high, though)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

Data wrangling for next week

  • This has been completed; create_human.R” is available in the repo under Scripts and its output under Data.

(more chapters to be added similarly as we proceed with the course!)